Skip to content

Flush OTEL logs with Vercel waitUntil#354

Merged
ben-fornefeld merged 1 commit into
mainfrom
fix/waituntil-log-export
Jun 5, 2026
Merged

Flush OTEL logs with Vercel waitUntil#354
ben-fornefeld merged 1 commit into
mainfrom
fix/waituntil-log-export

Conversation

@ben-fornefeld

Copy link
Copy Markdown
Member

Summary

  • update Next.js and @next/env to 16.2.7
  • wrap the OTLP batch log processor with a Vercel waitUntil-aware processor
  • add focused unit coverage for the log processor

Verification

  • bun test tests/unit/vercel-wait-until-log-record-processor.test.ts
  • bun run check (passes with existing warnings)
  • bunx next typegen
  • bunx tsc --noEmit --pretty false

Build note: bun run build was blocked locally by missing billing env values: BILLING_API_URL or NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY while NEXT_PUBLIC_INCLUDE_BILLING is enabled.

@cla-bot cla-bot Bot added the cla-signed label Jun 5, 2026
@cursor

cursor Bot commented Jun 5, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Patch Next bump and observability-only wiring; behavior outside Vercel stays delegated-only with no flush scheduling.

Overview
Bumps next and @next/env from 16.2.6 to 16.2.7 in package.json and bun.lock.

OpenTelemetry log export is wrapped in VercelWaitUntilLogRecordProcessor, which after each emitted log schedules a single forceFlush on the inner batch processor through Vercel’s waitUntil hook (via Symbol.for('@vercel/request-context')). Node instrumentation uses this wrapper around the existing OTLP batch log pipeline. Unit tests cover delegation without Vercel context and one waitUntil flush per request context.

Reviewed by Cursor Bugbot for commit 44e4931. Bugbot is set up for automated code reviews on this repo. Configure here.

@vercel

vercel Bot commented Jun 5, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 5, 2026 4:44am
web-juliett Ready Ready Preview, Comment Jun 5, 2026 4:44am

Request Review

@ben-fornefeld ben-fornefeld marked this pull request as ready for review June 5, 2026 05:12
@ben-fornefeld ben-fornefeld requested a review from drankou as a code owner June 5, 2026 05:12
Copilot AI review requested due to automatic review settings June 5, 2026 05:12
@ben-fornefeld ben-fornefeld merged commit 9f874e9 into main Jun 5, 2026
17 checks passed
@ben-fornefeld ben-fornefeld deleted the fix/waituntil-log-export branch June 5, 2026 05:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves OpenTelemetry log delivery reliability on Vercel by ensuring the OTLP batch log processor gets flushed via Vercel’s waitUntil, and includes a small Next.js patch update plus focused unit coverage for the new processor.

Changes:

  • Bump next and @next/env from 16.2.6 to 16.2.7.
  • Introduce VercelWaitUntilLogRecordProcessor to schedule a forceFlush() through Vercel request context waitUntil.
  • Add unit tests validating delegation behavior and “single scheduled flush while pending” semantics.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/vercel-wait-until-log-record-processor.test.ts Adds unit coverage for the waitUntil-based log flushing wrapper.
src/instrumentation.node.ts Wraps the OTLP BatchLogRecordProcessor with the new Vercel waitUntil-aware processor.
src/core/server/observability/vercel-wait-until-log-record-processor.ts Implements the wrapper that schedules log flushes via Vercel request context waitUntil.
package.json Updates next and @next/env dependency versions to 16.2.7.
bun.lock Locks updated Next.js-related package versions/checksums.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +65

try {
requestContext.waitUntil(async () => {
try {
await this.delegate.forceFlush()
} catch (error) {
diag.warn(
'Failed to flush OpenTelemetry logs in Vercel waitUntil',
error
)
} finally {
this.pendingFlushContexts.delete(requestContext)
}
})
} catch (error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The dedupe set marks the context as "pending" for the entire duration of forceFlush(), but BatchLogRecordProcessor.forceFlush() snapshots its internal buffer synchronously and then awaits the OTLP export. Any log emitted during that export window goes into the next batch — but scheduleRequestFlush short-circuits (line 46-48) because the context is still in pendingFlushContexts, so no new waitUntil is registered. When the in-flight flush resolves, Vercel can terminate the runtime before the BatchLogRecordProcessor's 5s timer fires (it's unref()'d), silently dropping that log — the exact failure mode this PR aims to prevent. Fix: delete the context from pendingFlushContexts before awaiting forceFlush, so concurrent emits schedule a fresh waitUntil that picks up late additions.

Extended reasoning...

What the bug is

scheduleRequestFlush at src/core/server/observability/vercel-wait-until-log-record-processor.ts:43-72 uses pendingFlushContexts to dedupe — only one waitUntil(forceFlush) should be scheduled per request context. The intent is sound, but the dedupe flag is released too late: the context is removed from the set only in the finally block at line 62, after forceFlush() resolves. This is a classic flag-after-side-effect race.

Why it manifests

BatchLogRecordProcessor.forceFlush() (in @opentelemetry/sdk-logs) calls _flushAll(), which synchronously snapshots the buffer:

let toFlush = this._finishedLogRecords;
this._finishedLogRecords = [];   // buffer swapped to empty
// ... then async OTLP HTTP export happens

Records emitted after this snapshot land in the new (empty) buffer and are not part of the in-flight export. The OTLP HTTP POST typically takes tens to hundreds of milliseconds, which is plenty of time for additional logs to be emitted from concurrent async work (database calls, fetch instrumentation auto-spans, etc.).

Step-by-step proof

  1. Log A emitted → delegate.onEmit pushes A → scheduleRequestFlush adds requestContext to pendingFlushContexts, calls requestContext.waitUntil(asyncFn).
  2. asyncFn runs; delegate.forceFlush() is invoked. Internally, _flushAll snapshots toFlush = [A], resets the buffer to [], and starts the OTLP HTTPS POST. asyncFn is now suspended at await.
  3. The request handler continues (e.g. await db.query() or await fetch(...)). During this window, log B is emitted.
  4. delegate.onEmit(B, ctx) pushes B into the now-empty buffer. scheduleRequestFlush runs: pendingFlushContexts.has(requestContext) is true → early return at line 46-48. No new waitUntil is registered.
  5. The OTLP export for [A] resolves. _flushAll calls _maybeStartTimer which schedules a delayed flush — but the BatchLogRecordProcessor's timer is unref()'d, so it cannot keep the Vercel runtime alive.
  6. asyncFn's finally removes requestContext from pendingFlushContexts and the async function resolves.
  7. Vercel observes that all waitUntil promises have settled. The runtime is terminated.
  8. Log B is still sitting in BatchLogRecordProcessor._finishedLogRecords and is silently dropped.

This is precisely the failure mode the PR title ("flush otel logs with vercel waitUntil") is trying to prevent — under realistic conditions where logs are interleaved with async I/O, the fix is only partial.

Why existing code does not prevent it

The test at tests/unit/vercel-wait-until-log-record-processor.test.ts exercises back-to-back synchronous emissions before the callback runs (processor.onEmit(); processor.onEmit(); await callbacks[0]();). It does not cover emissions that happen during the await this.delegate.forceFlush() window — the exact case this bug describes.

Impact

In Vercel serverless functions handling typical async request workflows, logs emitted while an OTLP flush is in flight are at risk of being lost on runtime termination. The race window is the OTLP HTTP RTT (tens to hundreds of ms), easily long enough to overlap with subsequent log emissions. Auto-instrumentations like FetchInstrumentation that emit telemetry from async span completions further widen the window.

How to fix

Remove the context from pendingFlushContexts before awaiting forceFlush, so any log emitted during the export schedules a fresh waitUntil:

requestContext.waitUntil(async () => {
  this.pendingFlushContexts.delete(requestContext)
  try {
    await this.delegate.forceFlush()
  } catch (error) {
    diag.warn('Failed to flush OpenTelemetry logs in Vercel waitUntil', error)
  }
})

With this change, concurrent emits add the context back into the set and register a new waitUntil, extending the runtime lifetime until the late additions are exported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants